individual gradient
A Practical Debugging Tool for the Training of Deep Neural Networks Supplementary Material Checklist
Do the main claims made in the abstract and introduction accurately reflect the paper's Did you describe the limitations of your work? Did you discuss any potential negative societal impacts of your work? In general, we believe, this work will have an overall positive impact as it can help shed light into the black-box that is deep learning. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] All experimental results, as well as the complete code base to reproduce them can be Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?
Breaking Secure Aggregation: Label Leakage from Aggregated Gradients in Federated Learning
Wang, Zhibo, Chang, Zhiwei, Hu, Jiahui, Pang, Xiaoyi, Du, Jiacheng, Chen, Yongle, Ren, Kui
Federated Learning (FL) exhibits privacy vulnerabilities under gradient inversion attacks (GIAs), which can extract private information from individual gradients. To enhance privacy, FL incorporates Secure Aggregation (SA) to prevent the server from obtaining individual gradients, thus effectively resisting GIAs. In this paper, we propose a stealthy label inference attack to bypass SA and recover individual clients' private labels. Specifically, we conduct a theoretical analysis of label inference from the aggregated gradients that are exclusively obtained after implementing SA. The analysis results reveal that the inputs (embeddings) and outputs (logits) of the final fully connected layer (FCL) contribute to gradient disaggregation and label restoration. To preset the embeddings and logits of FCL, we craft a fishing model by solely modifying the parameters of a single batch normalization (BN) layer in the original model. Distributing client-specific fishing models, the server can derive the individual gradients regarding the bias of FCL by resolving a linear system with expected embeddings and the aggregated gradients as coefficients. Then the labels of each client can be precisely computed based on preset logits and gradients of FCL's bias. Extensive experiments show that our attack achieves large-scale label recovery with 100\% accuracy on various datasets and model architectures.
Cockpit: A Practical Debugging Tool for Training Deep Neural Networks
Schneider, Frank, Dangel, Felix, Hennig, Philipp
When engineers train deep learning models, they are very much "flying blind". Commonly used approaches for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network's training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a collection of instruments that enable a closer look into the inner workings of a learning machine, and a more informative and meaningful status report for practitioners. It facilitates the identification of learning phases and failure modes, like ill-chosen hyperparameters. These instruments leverage novel higher-order information about the gradient distribution and curvature, which has only recently become efficiently accessible. We believe that such a debugging tool, which we open-source for PyTorch, represents an important step to improve troubleshooting the training process, reveal new insights, and help develop novel methods and heuristics.
BackPACK: Packing more into backprop
Dangel, Felix, Kunstner, Frederik, Hennig, Philipp
Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Y et, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-learning software does not support their automatic calculation. Manually implementing them is burdensome, inefficient if done na ıvely, and the resulting code is rarely shared. This hampers progress in deep learning, and unnecessarily narrows research to focus on gradient descent and its variants; it also complicates replication studies and comparisons between newly developed methods that require those quantities, to the point of impossibility. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization. The success of deep learning and the applications it fuels can be traced to the popularization of automatic differentiation frameworks. However, this specialization also has its shortcomings: it assumes the user only wants to compute gradients or, more precisely, the average of gradients across a mini-batch of examples. Other quantities can also be computed with automatic differentiation at a comparable cost or minimal overhead to the gradient backpropagation pass; for example, approximate second-order information or the variance of gradients within the batch. These quantities are valuable to understand the geometry of deep neural networks, for the identification of free parameters, and to push the development of more efficient optimization algorithms. But researchers who want to investigate their use face a chicken-and-egg problem: automatic differentiation tools required to go beyond standard gradient methods are not available, but there is no incentive for their implementation in existing deep-learning software as long as no large portion of the users need it. Second-order methods for deep learning have been continuously investigated for decades (e.g., Becker & Le Cun, 1989; Amari, 1998; Bordes et al., 2009; Martens & Grosse, 2015).
An elegant way to represent forward propagation and back propagation in a neural network
Sometimes, you see a diagram and it gives you an'aha ha' moment I saw it on Frederick kratzert's blog Using the input variables x and y, The forwardpass (left half of the figure) calculates output z as a function of x and y i.e. f(x,y) The right side of the figures shows the backwardpass. Receiving dL/dz (the derivative of the total loss with respect to the output z), we can calculate the individual gradients of x and y on the loss function by applying the chain rule, as shown in the figure. This post is a part of my forthcoming book on Mathematical foundations of Data Science. The goal of the neural network is to minimise the loss function for the whole network of neurons. Hence, the problem of solving equations represented by the neural network also becomes a problem of minimising the loss function for the entire network.
Learning Step Size Controllers for Robust Neural Network Training
Daniel, Christian (TU Darmstadt) | Taylor, Jonathan (Microsoft Research) | Nowozin, Sebastian (Microsoft Research)
This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Starting with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the initial learning rate which has to be chosen by the experimenter. We investigate several features and show how an adaptive controller can adjust the learning rate without prior knowledge of the learning problem at hand.